Using the Web as Corpus for Linguistic Research

نویسنده

Martin Volk

چکیده

In the last decade the working methods in Computational Linguistics have changed drastically. Fifteen years back, most research focused on selected example sentences. Nowadays the access to and exploitation of large text corpora is commonplace. This shift is reflected in a renaissance of work in Corpus Linguistics and documented in a number of pertinent books in recent years, e.g. the introductions by (Biber et al. 1998) and (Kennedy 1998) and the more methodologically oriented works on statistics and programming in Corpus Linguistics by (Oakes 1998) and (Mason 2000).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Allophone-based acoustic modeling for Persian phoneme recognition

Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...

متن کامل

Annotated Web As Corpus

This paper presents a proposal to facilitate the use of the annotated web as corpus by alleviating the annotation bottleneck for corpus data drawn from the web. We describe a framework for large-scale distributed corpus annotation using peerto-peer (P2P) technology to meet this need. We also propose to annotate a large reference corpus in order to evaluate this framework. This will allow us to ...

متن کامل

The development of a web corpus of Hindi language and corpus-based comparative studies to Japanese

In this paper, we discuss our creation of a web corpus of spoken Hindi (COSH), one of the Indo-Aryan languages spoken mainly in the Indian subcontinent. We also point out notable problems we’ve encountered in the web corpus and the special concordancer. After observing the kind of technical problems we encountered, especially regarding annotation tagged by Shiva Reddy’s tagger, we argue how the...

متن کامل

GoogleLing: The Web as a Linguistic Corpus

We describe software to transform any search engine or searchable corpus into a tool for linguistic research with a rich query syntax. We provide support for case sensitive searches, within-sentence and within-N-words match constraints, part-ofspeech restrictions on words, and “smart” verb-ending inflection wildcards. The software generalizes the query for the underlying search engine, and then...

متن کامل

Draft WebCorp: providing a renewable data source for corpus linguists

The many electronic text corpora available nowadays present ever fewer obstacles to a wide range of corpus linguistic study. However, corpora are expensive resources to create and to update, and there remain problems for linguists if they seek access to very large, very recent, or changing language. The World Wide Web, whilst intended as an information source, is an obvious resource for the ret...

متن کامل